{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f72b2f98-a4d1-4024-b402-a15c39a19de9",
   "metadata": {},
   "source": [
    "# Parallel Processing SimpleDirectoryReader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0457c81-2894-4d87-9a0d-2f83deeb567f",
   "metadata": {},
   "source": [
    "In this notebook, we demonstrate how to use parallel processing when loading data with `SimpleDirectoryReader`. Parallel processing can be useful with heavier workloads i.e., loading from a directory consisting of many files. (NOTE: if using Windows, you may see less gains when using parallel processing for loading data. This has to do with the differences between how multiprocess works in linux/mac and windows e.g., see [here](https://pythonforthelab.com/blog/differences-between-multiprocessing-windows-and-linux/) or [here](https://stackoverflow.com/questions/52465237/multiprocessing-slower-than-serial-processing-in-windows-but-not-in-linux))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8a5f90f-0a56-40e5-a8b6-07ecc57e0b95",
   "metadata": {},
   "outputs": [],
   "source": [
    "import cProfile, pstats\n",
    "from pstats import SortKey"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4113f2b2-ae63-408d-9c92-72e920332013",
   "metadata": {},
   "source": [
    "In this demo, we'll use the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai). This dataset is based off of a set of 32 PDF files which are included in the download from llamahub. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d8049d6-c312-49f6-9c60-07edc5a5c284",
   "metadata": {},
   "outputs": [],
   "source": [
    "!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63eba1de-a23f-49fa-85e6-a6709e563a14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "# define our reader with the directory containing the 32 pdf files\n",
    "reader = SimpleDirectoryReader(input_dir=\"./data/source_files\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f9fe429-693f-4993-ab37-488e67009368",
   "metadata": {},
   "source": [
    "### Sequential Load"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "002dbc1f-64cc-4f23-85aa-bb73ccf554b8",
   "metadata": {},
   "source": [
    "Sequential loading is the default behaviour and can be executed via the `load_data()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0d003c3-df06-47f1-a15b-51dfa4177bd9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4306"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents = reader.load_data()\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8226dd40-d940-4061-824a-d6cb536fb1de",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wed Jan 10 12:40:50 2024    oldstats\n",
      "\n",
      "         1857432165 function calls (1853977584 primitive calls) in 391.159 seconds\n",
      "\n",
      "   Ordered by: cumulative time\n",
      "   List reduced from 292 to 15 due to restriction <15>\n",
      "\n",
      "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
      "        1    0.000    0.000  391.159  391.159 {built-in method builtins.exec}\n",
      "        1    0.003    0.003  391.158  391.158 <string>:1(<module>)\n",
      "        1    0.000    0.000  391.156  391.156 base.py:367(load_data)\n",
      "       32    0.000    0.000  391.153   12.224 base.py:256(load_file)\n",
      "       32    0.127    0.004  391.149   12.223 docs_reader.py:24(load_data)\n",
      "     4306    1.285    0.000  387.685    0.090 _page.py:2195(extract_text)\n",
      "4444/4306    5.984    0.001  386.399    0.090 _page.py:1861(_extract_text)\n",
      "     4444    0.006    0.000  270.543    0.061 _data_structures.py:1220(operations)\n",
      "     4444   43.270    0.010  270.536    0.061 _data_structures.py:1084(_parse_content_stream)\n",
      "36489963/33454574   32.688    0.000  167.817    0.000 _data_structures.py:1248(read_object)\n",
      " 23470599   19.764    0.000  100.843    0.000 _page.py:1944(process_operation)\n",
      " 48258569   37.205    0.000   75.145    0.000 _utils.py:200(read_until_regex)\n",
      " 25208954   11.215    0.000   64.272    0.000 _base.py:481(read_from_stream)\n",
      " 18016574   23.488    0.000   49.305    0.000 __init__.py:88(crlf_space_check)\n",
      "  8642699   20.779    0.000   48.224    0.000 _utils.py:14(read_hex_string_from_stream)\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<pstats.Stats at 0x16bb3d300>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cProfile.run(\"reader.load_data()\", \"oldstats\")\n",
    "p = pstats.Stats(\"oldstats\")\n",
    "p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e818f66-ddd9-4901-9645-3a52f9883c0e",
   "metadata": {},
   "source": [
    "### Parallel Load"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7a3bc0a-a42c-4fd7-8f4e-7e9110b4f4fc",
   "metadata": {},
   "source": [
    "To load using parallel processes, we set `num_workers` to a positive integer value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef51423f-6921-4956-a00f-c26783ba8f32",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = reader.load_data(num_workers=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e984faa7-b849-4d45-aa84-53df67a8e39f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4306"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b545306-ff91-4fe5-952e-b497f8da1efe",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wed Jan 10 13:05:13 2024    newstats\n",
      "\n",
      "         12539 function calls in 31.319 seconds\n",
      "\n",
      "   Ordered by: cumulative time\n",
      "   List reduced from 212 to 15 due to restriction <15>\n",
      "\n",
      "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
      "        1    0.000    0.000   31.319   31.319 {built-in method builtins.exec}\n",
      "        1    0.003    0.003   31.319   31.319 <string>:1(<module>)\n",
      "        1    0.000    0.000   31.316   31.316 base.py:367(load_data)\n",
      "       24    0.000    0.000   31.139    1.297 threading.py:589(wait)\n",
      "       23    0.000    0.000   31.139    1.354 threading.py:288(wait)\n",
      "      155   31.138    0.201   31.138    0.201 {method 'acquire' of '_thread.lock' objects}\n",
      "        1    0.000    0.000   31.133   31.133 pool.py:369(starmap)\n",
      "        1    0.000    0.000   31.133   31.133 pool.py:767(get)\n",
      "        1    0.000    0.000   31.133   31.133 pool.py:764(wait)\n",
      "        1    0.000    0.000    0.155    0.155 context.py:115(Pool)\n",
      "        1    0.000    0.000    0.155    0.155 pool.py:183(__init__)\n",
      "        1    0.000    0.000    0.153    0.153 pool.py:305(_repopulate_pool)\n",
      "        1    0.001    0.001    0.153    0.153 pool.py:314(_repopulate_pool_static)\n",
      "       10    0.001    0.000    0.152    0.015 process.py:110(start)\n",
      "       10    0.001    0.000    0.150    0.015 context.py:285(_Popen)\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<pstats.Stats at 0x29408ab30>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cProfile.run(\"reader.load_data(num_workers=10)\", \"newstats\")\n",
    "p = pstats.Stats(\"newstats\")\n",
    "p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3e40fd1-1012-47e1-ab23-3f3159cbb4f2",
   "metadata": {},
   "source": [
    "### In Conclusion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df1cb9ba-7c83-4721-8b1b-799baa65c0dd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13.033333333333333"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "391 / 30"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "653d7018-70a1-4efa-bb82-64d7b3ce8aae",
   "metadata": {},
   "source": [
    "As one can observe from the results above, there is a ~13x speed up (or 1200% speed increase) when using parallel processing when loading from a directory with many files."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_3.10",
   "language": "python",
   "name": "llama_index_3.10"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
